Skip to content

Exclude DataNodes being removed from new Region allocation#17934

Merged
CRZbulabula merged 4 commits into
masterfrom
fix_v2_990_alloc_excludes_removing_dn
Jun 15, 2026
Merged

Exclude DataNodes being removed from new Region allocation#17934
CRZbulabula merged 4 commits into
masterfrom
fix_v2_990_alloc_excludes_removing_dn

Conversation

@CRZbulabula

Copy link
Copy Markdown
Contributor

Problem

When a remove datanode is in progress, the ConfigNode could still allocate brand-new Region replicas onto the DataNode being removed.

This is especially likely when the target DataNode was killed (e.g. kill -9) before the removal: the failure detector reports such a node as Unknown rather than Removing, and RegionBalancer intentionally keeps Unknown DataNodes as allocation candidates (to cope with an insufficient number of online nodes). As a result, a new region (e.g. for the internal root.__system database) could be assigned a replica on the dead node. That replica can never be created there (Connection refused), yet the metadata keeps the assignment and retries forever, so the RemoveDataNodesProcedure gets stuck and the target DataNode never disappears from show datanodes.

Observed timeline (from the report):

11:31:06.210  Submit RemoveDataNodesProcedure successfully
11:31:06.550  create SchemaRegion 8 on DataNodes [3, 7, 6]   # DN7 is being removed
11:31:07.444  create DataRegion 9 on DataNodes [5, 7]        # DN7 is being removed
              Failed to CREATE_*_REGION on DataNode 7: Connection refused  (retried forever)

Root cause

RegionBalancer.genRegionGroupsAllocationPlan gathers allocation candidates with:

getNodeManager().filterDataNodeThroughStatus(NodeStatus.Running, NodeStatus.Unknown)

A node-status filter alone cannot fix this: a DataNode killed before removal is Unknown (not Removing), because DataNodeHeartbeatCache.updateCurrentStatistics lets the failure detector override the Removing status back to Unknown once heartbeats stop. So the removing node still passes the filter.

Fix

Consult the in-progress RemoveDataNodesProcedure — the authoritative, leader-switch-durable source of which DataNodes are being removed — and exclude those nodes from the allocation candidates.

  • Add ProcedureManager.getRemovingDataNodeIds(), which scans the unfinished RemoveDataNodesProcedure(s) and returns the removing DataNode ids. This mirrors the existing pattern in ProcedureManager.checkRemoveDataNodes / checkRegionOperationWithRemoveDataNode.
  • RegionBalancer.genRegionGroupsAllocationPlan now filters those ids out of the candidate list, in the same spirit as the existing RemoveDataNodeHandler.selectedRegionMigrationPlans.

This keeps the existing "allow Unknown candidates when online nodes are scarce" behavior intact, and only removes nodes that are actively being removed. The removal procedure already guarantees (checkEnoughDataNodeAfterRemoving) that enough non-removing nodes remain, so this never spuriously triggers NotEnoughDataNodeException.

Test

IoTDBRemoveDataNodeRegionAllocationIT kills a DataNode that hosts regions, submits the removal, and — while the removal is still in progress — forces a fresh Region allocation by creating a new database. It asserts that none of the newly allocated regions land on the DataNode being removed (comparing against a pre-allocation snapshot of region ids, since the removing node legitimately keeps hosting its own pre-existing regions until each finishes migrating away), and that the removal then completes.

🤖 Generated with Claude Code

When a `remove datanode` is in progress, the ConfigNode could still
allocate brand-new Region replicas onto the DataNode being removed.
This was especially likely when the target DataNode had been killed
(e.g. `kill -9`) before removal: the failure detector reports such a
node as `Unknown` rather than `Removing`, and `RegionBalancer`
intentionally keeps `Unknown` DataNodes as allocation candidates (to
cope with insufficient online nodes). The new replica could never be
created on the dead node (Connection refused), yet the metadata kept
the assignment and retried forever, so the removal hung and the target
DataNode never disappeared from `show datanodes`.

A node-status filter alone cannot fix this, because the killed node is
`Unknown`, not `Removing`. Instead, `RegionBalancer` now consults the
in-progress `RemoveDataNodesProcedure` (the authoritative, leader-switch
durable source of which DataNodes are being removed) via the new
`ProcedureManager.getRemovingDataNodeIds()` and drops those DataNodes
from the allocation candidates. This mirrors the existing filtering in
`RemoveDataNodeHandler.selectedRegionMigrationPlans`.

Add IoTDBRemoveDataNodeRegionAllocationIT: it kills a DataNode, submits
the removal, and while it is in progress forces a fresh Region
allocation via a new database, asserting that none of the newly
allocated Regions land on the DataNode being removed and that the
removal completes.
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.06%. Comparing base (abb9ef9) to head (9fb2e92).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...nfignode/manager/load/balancer/RegionBalancer.java 0.00% 8 Missing ⚠️
...che/iotdb/confignode/manager/ProcedureManager.java 0.00% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17934      +/-   ##
============================================
- Coverage     41.07%   41.06%   -0.01%     
  Complexity      318      318              
============================================
  Files          5257     5257              
  Lines        365010   365023      +13     
  Branches      47180    47180              
============================================
- Hits         149918   149910       -8     
- Misses       215092   215113      +21     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Caideyipi Caideyipi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the change and did not find any blocking issue. The fix correctly excludes DataNodes covered by an unfinished RemoveDataNodesProcedure from new Region allocation while preserving the existing Unknown-node candidate behavior.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents ConfigNode from allocating new Region replicas onto DataNodes that are currently being removed (even if they appear as Unknown due to heartbeat loss), avoiding stranded replicas that can stall RemoveDataNodesProcedure indefinitely.

Changes:

  • Add ProcedureManager.getRemovingDataNodeIds() to derive the authoritative “removing DataNode” set from in-progress RemoveDataNodesProcedure instances.
  • Update RegionBalancer.genRegionGroupsAllocationPlan to filter out those removing DataNodes from region-allocation candidates while still allowing Unknown nodes in general.
  • Add an integration test that kills a DataNode, starts removal, forces new Region allocation, and asserts newly allocated Regions do not land on the removing DataNode and that removal completes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/ProcedureManager.java Exposes removing-DataNode IDs by scanning in-progress RemoveDataNodesProcedures.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/balancer/RegionBalancer.java Excludes removing DataNodes from allocation candidates to prevent allocating new replicas onto nodes being removed.
integration-test/src/test/java/org/apache/iotdb/confignode/it/removedatanode/IoTDBRemoveDataNodeRegionAllocationIT.java Adds regression coverage for “new Region allocation during remove datanode” scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The test re-opened a multi-node connection after kill -9'ing the target
DataNode. EnvFactory.getConnection() fans every read out to all DataNodes
including the dead one, so each executeQuery returned
[Connection Error, null, null, null] and failed with
InconsistentDataException at the connection site rather than exercising
the bug. Pin both the post-kill connection and the allocation-probe
connection to a surviving DataNode via a new
BaseEnv.getConnection(DataNodeWrapper) overload.
@CRZbulabula

Copy link
Copy Markdown
Contributor Author

Fixed the CI failure in `IoTDBRemoveDataNodeRegionAllocationIT` (e75e196).

Root cause: after `kill -9`-ing the target DataNode, the test re-opened a multi-node `EnvFactory.getConnection()`. That connection fans every read out to all DataNodes, including the dead one, so each `executeQuery` (`SHOW DATANODES`, region maps) returned `[Connection Error, null, null, null]` and failed with `InconsistentDataException` at the connection site — before the test could exercise the allocation behaviour.

Fix: pin all post-kill SQL (both the main statement and the allocation-probe connection) to a surviving DataNode via a new `BaseEnv.getConnection(DataNodeWrapper)` overload, which scopes write and read to a single live node. This matches the existing framework intent ("This is useful when you shut down a dataNode").

The remaining CI reds (`IoTDBConnectionsIT` on Windows, `IoTDBPipeClusterIT` on dual-table-manual-enhanced) are unrelated pre-existing flakes — they touch the session and pipe subsystems, not region allocation.

With Ratis schema consensus and schemaReplicationFactor=2, kill -9'ing one
of the two schema-replica holders breaks Ratis quorum: the schema-region
migration off the dead node fails (DO_ADD_REGION_PEER times out), so
RemoveDataNodesProcedure rolls back and the node never leaves the cluster,
making awaitUntilSuccess time out after 5 minutes.

Raise schemaReplicationFactor to 3 (matching the proven
IoTDBRemoveUnknownDataNodeIT#successTest config) so one kill still leaves a
quorum (2 of 3) and the removal can actually complete. With 4 DataNodes,
removing 1 leaves exactly MINIMUM_DATANODE = max(schemaRF, dataRF) = 3, so
the removal is still permitted and the allocation candidate pool (the 3
survivors) is large enough for the forced new RF=3 schema / RF=2 data
regions.
@CRZbulabula

Copy link
Copy Markdown
Contributor Author

Second CI fix (9fb2e92).

After the connection fix, the IT got past the connection step and the core assertion (`assertNewRegionsExcludeDataNode` — new Regions exclude the removing node) passed, but it then timed out at `awaitUntilSuccess`: the removal never completed.

Root cause (from the ConfigNode cluster logs): schema Regions use Ratis, and with `schemaReplicationFactor=2`, `kill -9`-ing one of the two replica holders breaks Ratis quorum. The schema-Region migration off the dead node (`AddRegionPeerProcedure.DO_ADD_REGION_PEER`) times out because the old 2-peer config (one dead) can't reach majority, so `RemoveDataNodesProcedure` rolls back (`migratedFailedRegions:[SchemaRegion id:0]`) and the node stays in the cluster. This is correct product behaviour — it was a test-config mistake.

Fix: raise `schemaReplicationFactor` to 3 (matching the proven `IoTDBRemoveUnknownDataNodeIT#successTest` config). One kill then leaves a 2-of-3 quorum so the schema Region can migrate and the removal completes. With 4 DataNodes, removing 1 leaves exactly `MINIMUM_DATANODE = max(schemaRF, dataRF) = 3`, so removal is still permitted and the candidate pool (3 survivors) is large enough for the forced new RF=3 schema / RF=2 data Regions.

@sonarqubecloud

Copy link
Copy Markdown

@CRZbulabula CRZbulabula merged commit 08c55b9 into master Jun 15, 2026
43 of 44 checks passed
@CRZbulabula CRZbulabula deleted the fix_v2_990_alloc_excludes_removing_dn branch June 15, 2026 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants